New York City Taxi Fare Prediction Playground Competition

Many Thanks to Albert van Breemen who created a great notebook, which I forked to use as a starting point for my own data exploration and file prep: https://www.kaggle.com/breemen/nyc-taxi-fare-data-exploration

Reading data and first exploration

In [1]:
## Load some default Python modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
plt.style.use('seaborn-whitegrid')
In [2]:
## Read data in pandas dataframe, using nrows for exploration, then chunksize for later processing.
train_df =  pd.read_csv('./datafiles/train.csv', nrows=2000000, parse_dates=["pickup_datetime"])

## List first few rows (datapoints)
train_df.head()
Out[2]:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 2009-06-15 17:26:21.0000001 4.5 2009-06-15 17:26:21 -73.844311 40.721319 -73.841610 40.712278 1
1 2010-01-05 16:52:16.0000002 16.9 2010-01-05 16:52:16 -74.016048 40.711303 -73.979268 40.782004 1
2 2011-08-18 00:35:00.00000049 5.7 2011-08-18 00:35:00 -73.982738 40.761270 -73.991242 40.750562 2
3 2012-04-21 04:30:42.0000001 7.7 2012-04-21 04:30:42 -73.987130 40.733143 -73.991567 40.758092 1
4 2010-03-09 07:51:00.000000135 5.3 2010-03-09 07:51:00 -73.968095 40.768008 -73.956655 40.783762 1
In [3]:
## Check datatypes
train_df.dtypes
Out[3]:
key                          object
fare_amount                 float64
pickup_datetime      datetime64[ns]
pickup_longitude            float64
pickup_latitude             float64
dropoff_longitude           float64
dropoff_latitude            float64
passenger_count               int64
dtype: object
In [4]:
## Check statistics of the features
train_df.describe()
Out[4]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 2.000000e+06 2.000000e+06 2.000000e+06 1.999986e+06 1.999986e+06 2.000000e+06
mean 1.134779e+01 -7.252321e+01 3.992963e+01 -7.252395e+01 3.992808e+01 1.684113e+00
std 9.852883e+00 1.286804e+01 7.983352e+00 1.277497e+01 1.032382e+01 1.314982e+00
min -6.200000e+01 -3.377681e+03 -3.458665e+03 -3.383297e+03 -3.461541e+03 0.000000e+00
25% 6.000000e+00 -7.399208e+01 4.073491e+01 -7.399141e+01 4.073400e+01 1.000000e+00
50% 8.500000e+00 -7.398181e+01 4.075263e+01 -7.398016e+01 4.075312e+01 1.000000e+00
75% 1.250000e+01 -7.396713e+01 4.076710e+01 -7.396369e+01 4.076809e+01 2.000000e+00
max 1.273310e+03 2.856442e+03 2.621628e+03 3.414307e+03 3.345917e+03 2.080000e+02

First observations

  • There are negative fares. And while they may have a good explanation for why they're in the dataset (such as due to refunds), I don't think it's reasonable to expect that a significant portion of the test dataset would feature negative fares. And, since we've got a gigantic dataset, there's little harm in dropping any data that contains unhelpful information.
    • Furthermore, in researching the NYC Taxi rates, I see that that standard base charge for any taxi is $2.50, so I'm dropping everything from the dataset that's less than that.
  • Is there any missing data? If so, how many rows are affected? Does the test set contain missing data?
  • Is it possible that some of the Latitude and Longitude coordinates are swapped? Or is everything that is outside the scope of the city just junk data?
  • How can there be 0 passengers?
  • To be useful for exploration and prediction, I'll need to convert the dates from a timestamp object to individual colums for different aspects like, hour of the day, month, year, etc.

Compare test data

Load the test data to sanity check any changes to the train set.

In [5]:
## Read data in pandas dataframe
test_df =  pd.read_csv('./datafiles/test.csv', parse_dates=["pickup_datetime"])
test_df.head(5)
Out[5]:
key pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 2015-01-27 13:08:24.0000002 2015-01-27 13:08:24 -73.973320 40.763805 -73.981430 40.743835 1
1 2015-01-27 13:08:24.0000003 2015-01-27 13:08:24 -73.986862 40.719383 -73.998886 40.739201 1
2 2011-10-08 11:53:44.0000002 2011-10-08 11:53:44 -73.982524 40.751260 -73.979654 40.746139 1
3 2012-12-01 21:12:12.0000002 2012-12-01 21:12:12 -73.981160 40.767807 -73.990448 40.751635 1
4 2012-12-01 21:12:12.0000003 2012-12-01 21:12:12 -73.966046 40.789775 -73.988565 40.744427 1
In [6]:
test_df.describe()
Out[6]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 9914.000000 9914.000000 9914.000000 9914.000000 9914.000000
mean -73.974722 40.751041 -73.973657 40.751743 1.671273
std 0.042774 0.033541 0.039072 0.035435 1.278747
min -74.252193 40.573143 -74.263242 40.568973 1.000000
25% -73.992501 40.736125 -73.991247 40.735254 1.000000
50% -73.982326 40.753051 -73.980015 40.754065 1.000000
75% -73.968013 40.767113 -73.964059 40.768757 2.000000
max -72.986532 41.709555 -72.990963 41.696683 6.000000

Remove all fares less than $2.50

In [7]:
def fix_fares(df, verbose=False):
    if verbose:
        print("Removing all fares less than $2.50:")
        old_size = len(df)
        print("Old size: {}".format(old_size))

    df = df[df.fare_amount >= 2.5]

    if verbose:
        new_size = len(df)
        print("New size: {}".format(new_size))
        difference = old_size - new_size
        percent = (difference / old_size) * 100
        print("Dropped {} records, or {:.2f}%".format(difference, percent))
    
    return df

train_df = fix_fares(train_df, True)
Removing all fares less than $2.50:
Old size: 2000000
New size: 1999845
Dropped 155 records, or 0.01%

Remove missing data

Next up, lets check to see how much missing data exists in this dataset and compare it to the test set. Then we'll drop any rows with missing data from the training set, since it'd be better to use less data than to let bad data influence the model.

In [8]:
print("Train Dataset NaNs: {}".format(train_df.isnull().sum()))

print("Test Dataset NaNs: {}".format(test_df.isnull().sum()))
Train Dataset NaNs: key                   0
fare_amount           0
pickup_datetime       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude    14
dropoff_latitude     14
passenger_count       0
dtype: int64
Test Dataset NaNs: key                  0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
dtype: int64
In [9]:
def drop_nan(df, verbose=False):
    if verbose:
        print("Dropping all rows with NaNs:")
        old_size = len(df)
        print("Old size: {}".format(old_size))

    df.dropna(how ='any', axis ='rows', inplace=True)

    if verbose:
        new_size = len(df)
        print("New size: {}".format(new_size))
        difference = old_size - new_size
        percent = (difference / old_size) * 100
        print("Dropped {} records, or {:.2f}%".format(difference, percent))
    
    return df

train_df = drop_nan(train_df, True)
Dropping all rows with NaNs:
Old size: 1999845
New size: 1999831
Dropped 14 records, or 0.00%

Investigate Lat and Lon outliers

First, lets check to see if the test set has any outliers.

In [10]:
## Minimum and maximum longitude test set
min(test_df.pickup_longitude.min(), test_df.dropoff_longitude.min()), \
max(test_df.pickup_longitude.max(), test_df.dropoff_longitude.max())
Out[10]:
(-74.263242000000005, -72.986531999999997)
In [11]:
## Minimum and maximum latitude test
min(test_df.pickup_latitude.min(), test_df.dropoff_latitude.min()), \
max(test_df.pickup_latitude.max(), test_df.dropoff_latitude.max())
Out[11]:
(40.568973, 41.709555000000002)
In [12]:
def select_outside_boundingbox(df, BB):
    filter_df = df.loc[(df['pickup_longitude'] < BB[0]) | (df['pickup_longitude'] > BB[1]) | \
           (df['pickup_latitude'] < BB[2]) | (df['pickup_latitude'] > BB[3]) | \
           (df['dropoff_longitude'] < BB[0]) | (df['dropoff_longitude'] > BB[1]) | \
           (df['dropoff_latitude'] < BB[2]) | (df['dropoff_latitude'] > BB[3])]
    
    return filter_df

NYC_BB = (-74.5, -72.8, 40.5, 41.8)
In [13]:
latlon_outliers = select_outside_boundingbox(train_df, NYC_BB)

print("Train Dataset outliers: {}".format(latlon_outliers.sum()))

latlon_outliers
Train Dataset outliers: key                  2012-12-24 11:24:00.000000982013-11-23 12:57:0...
fare_amount                                                     513967
pickup_longitude                                               -201939
pickup_latitude                                                68060.2
dropoff_longitude                                              -205892
dropoff_latitude                                               64691.2
passenger_count                                                  70452
dtype: object
Out[13]:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
11 2012-12-24 11:24:00.00000098 5.50 2012-12-24 11:24:00 0.000000 0.000000 0.000000 0.000000 3
15 2013-11-23 12:57:00.000000190 5.00 2013-11-23 12:57:00 0.000000 0.000000 0.000000 0.000000 1
26 2011-02-07 20:01:00.000000114 6.50 2011-02-07 20:01:00 0.000000 0.000000 0.000000 0.000000 1
124 2013-01-17 17:22:00.00000043 8.00 2013-01-17 17:22:00 0.000000 0.000000 0.000000 0.000000 2
192 2010-09-05 17:08:00.00000092 3.70 2010-09-05 17:08:00 0.000000 0.000000 0.000000 0.000000 5
233 2011-07-24 01:14:35.0000002 8.50 2011-07-24 01:14:35 0.000000 0.000000 0.000000 0.000000 2
273 2009-10-30 18:13:00.00000021 8.10 2009-10-30 18:13:00 0.000000 0.000000 0.000000 0.000000 4
357 2013-07-04 16:41:27.0000002 8.50 2013-07-04 16:41:27 0.000000 0.000000 0.000000 0.000000 1
376 2014-05-29 05:57:22.0000001 2.50 2014-05-29 05:57:22 0.000000 0.000000 0.000000 0.000000 1
387 2012-11-15 08:39:00.00000095 13.00 2012-11-15 08:39:00 0.000000 0.000000 0.000000 0.000000 1
472 2009-02-22 22:48:00.000000130 2.50 2009-02-22 22:48:00 0.000000 0.000000 -74.005433 40.726685 2
498 2009-03-16 04:47:17.0000001 17.00 2009-03-16 04:47:17 0.000000 0.000000 0.000000 0.000000 1
540 2013-10-29 01:09:14.0000002 16.00 2013-10-29 01:09:14 0.000000 0.000000 0.000000 0.000000 1
542 2009-12-29 14:23:00.000000158 4.50 2009-12-29 14:23:00 0.000000 0.000000 0.000000 0.000000 1
568 2014-10-08 06:49:18.0000001 6.50 2014-10-08 06:49:18 0.000000 0.000000 0.000000 0.000000 1
660 2010-08-18 03:46:00.0000002 13.30 2010-08-18 03:46:00 0.000000 0.000000 0.000000 0.000000 5
728 2010-02-13 01:52:11.0000003 11.70 2010-02-13 01:52:11 0.000000 0.000000 0.000000 0.000000 1
799 2013-12-08 23:39:00.00000080 3.00 2013-12-08 23:39:00 0.000000 0.000000 0.000000 0.000000 5
872 2011-09-25 02:07:00.000000139 8.50 2011-09-25 02:07:00 0.000000 0.000000 0.000000 0.000000 1
881 2011-07-19 23:43:12.0000001 8.50 2011-07-19 23:43:12 0.000000 0.000000 0.000000 0.000000 1
887 2014-09-26 19:41:41.0000004 12.00 2014-09-26 19:41:41 0.000000 0.000000 0.000000 0.000000 1
958 2011-02-22 14:09:00.00000066 2.50 2011-02-22 14:09:00 0.000000 0.000000 0.000000 0.000000 5
960 2013-05-31 05:02:30.0000001 12.50 2013-05-31 05:02:30 0.000000 0.000000 0.000000 0.000000 1
964 2009-06-30 10:20:09.0000005 14.10 2009-06-30 10:20:09 0.000000 0.000000 0.000000 0.000000 1
966 2015-06-08 01:59:17.0000001 28.00 2015-06-08 01:59:17 0.000000 0.000000 0.000000 0.000000 1
1181 2012-10-11 00:21:00.000000140 25.00 2012-10-11 00:21:00 -0.004093 0.033500 0.016852 0.017980 2
1260 2011-03-10 20:25:00.00000049 5.70 2011-03-10 20:25:00 -73.973907 40.754743 0.000000 0.000000 2
1314 2012-09-20 20:06:50.0000001 11.50 2012-09-20 20:06:50 0.000000 0.000000 0.000000 0.000000 1
1316 2014-06-17 17:50:00.000000213 8.50 2014-06-17 17:50:00 0.000000 0.000000 0.000000 0.000000 1
1397 2009-02-17 07:30:00.00000027 4.50 2009-02-17 07:30:00 0.000000 0.000000 0.000000 0.000000 2
... ... ... ... ... ... ... ... ...
1998581 2010-11-13 03:17:04.0000001 12.10 2010-11-13 03:17:04 0.000000 0.000000 0.000000 0.000000 1
1998625 2014-11-23 21:24:11.0000001 57.83 2014-11-23 21:24:11 0.000000 0.000000 0.000000 0.000000 1
1998638 2009-12-16 09:56:23.0000004 6.50 2009-12-16 09:56:23 0.000000 0.000000 0.000000 0.000000 1
1998654 2010-01-12 12:36:00.000000206 6.10 2010-01-12 12:36:00 0.000000 0.000000 0.000000 0.000000 3
1998789 2011-06-03 07:45:00.000000135 49.80 2011-06-03 07:45:00 0.000000 0.000000 0.000000 0.000000 1
1998809 2011-09-24 04:32:19.0000001 8.10 2011-09-24 04:32:19 0.000000 0.000000 0.000000 0.000000 1
1998984 2014-03-26 09:05:24.0000001 8.00 2014-03-26 09:05:24 0.000000 0.000000 0.000000 0.000000 1
1999029 2010-04-20 21:28:00.000000190 6.90 2010-04-20 21:28:00 0.000000 0.000000 0.000000 0.000000 1
1999057 2014-12-22 20:55:54.0000001 11.50 2014-12-22 20:55:54 0.000000 0.000000 0.000000 0.000000 1
1999104 2011-05-27 18:00:00.000000168 5.70 2011-05-27 18:00:00 0.000000 0.000000 0.000000 0.000000 1
1999118 2012-05-11 00:17:51.0000001 19.30 2012-05-11 00:17:51 0.000000 0.000000 0.000000 0.000000 1
1999301 2010-05-01 16:08:00.000000208 8.50 2010-05-01 16:08:00 0.000000 0.000000 0.000000 0.000000 6
1999318 2012-09-09 01:48:10.0000004 4.00 2012-09-09 01:48:10 0.000000 0.000000 0.000000 0.000000 2
1999335 2014-05-02 17:06:00.00000017 4.00 2014-05-02 17:06:00 0.000000 0.000000 0.000000 0.000000 1
1999344 2014-09-08 15:53:00.00000056 7.50 2014-09-08 15:53:00 0.000000 0.000000 0.000000 0.000000 6
1999382 2012-11-20 19:12:00.00000011 56.80 2012-11-20 19:12:00 0.000000 0.000000 0.000000 0.000000 5
1999395 2009-05-21 15:25:32.0000005 4.50 2009-05-21 15:25:32 0.000000 0.000000 0.000000 0.000000 1
1999486 2011-07-01 18:14:29.0000002 6.50 2011-07-01 18:14:29 0.000000 0.000000 0.000000 0.000000 4
1999489 2009-01-09 18:54:59.0000002 11.10 2009-01-09 18:54:59 0.000000 0.000000 0.000000 0.000000 1
1999538 2015-06-11 21:03:32.0000002 2.50 2015-06-11 21:03:32 0.000000 0.000000 -73.954582 40.784191 2
1999570 2010-05-25 20:27:00.00000019 13.30 2010-05-25 20:27:00 0.000000 0.000000 0.000000 0.000000 1
1999628 2009-07-03 01:17:37.0000001 7.30 2009-07-03 01:17:37 0.000000 0.000000 0.000000 0.000000 1
1999638 2012-06-30 14:43:15.0000001 9.30 2012-06-30 14:43:15 0.000000 0.000000 0.000000 0.000000 1
1999648 2011-08-20 23:17:00.000000211 7.30 2011-08-20 23:17:00 0.000000 0.000000 0.000000 0.000000 1
1999688 2011-08-04 07:17:00.00000075 2.50 2011-08-04 07:17:00 0.000000 0.000000 0.000000 0.000000 1
1999705 2012-08-03 11:54:00.00000042 35.70 2012-08-03 11:54:00 0.000000 0.000000 0.000000 0.000000 1
1999748 2012-07-30 23:55:00.00000073 5.70 2012-07-30 23:55:00 0.000000 0.000000 0.000000 0.000000 5
1999852 2012-09-05 11:58:00.000000171 7.00 2012-09-05 11:58:00 0.000000 0.000000 0.000000 0.000000 2
1999868 2010-11-13 21:40:22.0000001 10.10 2010-11-13 21:40:22 0.000000 0.000000 0.000000 0.000000 3
1999923 2015-03-05 18:19:46.0000006 2.50 2015-03-05 18:19:46 -73.990112 40.746502 0.000000 0.000000 1

41972 rows × 8 columns

Remove 0's from dataset

I'm noticing a lot of rows with at least one 0 in a column (Latitude, Longitude, and/or Passengers), which doesn't make much sense in the scope of this problem, so I'm going to drop all rows with a 0, as junk data, especially since the test dataset didn't contain any records with 0 passengers.

In [14]:
def drop_0s(df, verbose=False):
    if verbose:
        print("Dropping all rows with 0s:")
        old_size = len(df)
        print("Old size: {}".format(old_size))

    df = df.loc[~(df == 0).any(axis=1)]

    if verbose:
        new_size = len(df)
        print("New size: {}".format(new_size))
        difference = old_size - new_size
        percent = (difference / old_size) * 100
        print("Dropped {} records, or {:.2f}%".format(difference, percent))
    
    return df

train_df = drop_0s(train_df, True)

train_df.describe()
Dropping all rows with 0s:
Old size: 1999831
New size: 1953460
Dropped 46371 records, or 2.32%
Out[14]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 1.953460e+06 1.953460e+06 1.953460e+06 1.953460e+06 1.953460e+06 1.953460e+06
mean 1.134222e+01 -7.391591e+01 4.069553e+01 -7.391419e+01 4.069307e+01 1.690540e+00
std 9.748895e+00 8.068185e+00 5.367408e+00 7.805896e+00 8.471085e+00 1.305509e+00
min 2.500000e+00 -3.377681e+03 -3.458665e+03 -3.383297e+03 -3.461541e+03 1.000000e+00
25% 6.000000e+00 -7.399229e+01 4.073648e+01 -7.399159e+01 4.073544e+01 1.000000e+00
50% 8.500000e+00 -7.398209e+01 4.075330e+01 -7.398061e+01 4.075380e+01 1.000000e+00
75% 1.250000e+01 -7.396830e+01 4.076750e+01 -7.396530e+01 4.076838e+01 2.000000e+00
max 5.000000e+02 2.856442e+03 2.621628e+03 3.414307e+03 3.345917e+03 9.000000e+00
In [15]:
latlon_outliers = drop_0s(latlon_outliers, True)

latlon_outliers.describe()
Dropping all rows with 0s:
Old size: 41972
New size: 2554
Dropped 39418 records, or 93.91%
Out[15]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 2554.000000 2554.000000 2554.000000 2554.000000 2554.000000 2554.00000
mean 14.113344 -28.671107 -1.728532 -28.028814 -3.874352 1.90603
std 21.666389 218.533715 142.267618 210.980402 230.037012 1.53168
min 2.500000 -3377.680935 -3458.664702 -3383.296608 -3461.540872 1.00000
25% 6.100000 -73.982917 -73.975712 -73.981267 -73.972000 1.00000
50% 8.900000 -7.988356 39.606164 -38.976314 39.604093 1.00000
75% 14.100000 40.740424 40.734281 40.740355 40.739597 2.00000
max 495.000000 2856.441560 2621.628430 3414.306675 3345.917353 6.00000

Now it looks like a significant portion of these might have useable data, just with the latitude and/or longitude accidentally swapped. Let's graph it out to see what we're dealing with.

In [16]:
import matplotlib.patches as patches

def plot(df, BB, s=10, alpha=0.2):
    fig, axs = plt.subplots(1, 4, figsize=(20,5))
    
    axs[0].scatter(df.pickup_longitude, df.pickup_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[0].set_xlabel('Latitude')
    axs[0].set_ylabel('Longitude')
    axs[0].set_title('Pickup locations')

    axs[1].scatter(df.dropoff_longitude, df.dropoff_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[1].set_ylabel('Latitude')
    axs[1].set_ylabel('Longitude')
    axs[1].set_title('Dropoff locations')
    
    axs[2].scatter(df.pickup_longitude, df.pickup_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[2].set_xlim((BB[2], BB[3]))
    axs[2].set_xlabel('Latitude')
    axs[2].set_ylim((BB[0], BB[1]))
    axs[2].set_ylabel('Longitude')
    axs[2].set_title('Inverted Pickup locations')

    axs[3].scatter(df.dropoff_longitude, df.dropoff_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[3].set_xlim((BB[2], BB[3]))
    axs[3].set_ylabel('Latitude')
    axs[3].set_ylim((BB[0], BB[1]))
    axs[3].set_ylabel('Longitude')
    axs[3].set_title('Inverted Dropoff locations')
    
plot(latlon_outliers, NYC_BB)
In [17]:
def select_within_boundingbox(df, BB):
    filter_df = df.loc[(df['pickup_longitude'] >= BB[0]) & (df['pickup_longitude'] <= BB[1]) & \
           (df['pickup_latitude'] >= BB[2]) & (df['pickup_latitude'] <= BB[3]) & \
           (df['dropoff_longitude'] >= BB[0]) & (df['dropoff_longitude'] <= BB[1]) & \
           (df['dropoff_latitude'] >= BB[2]) & (df['dropoff_latitude'] <= BB[3])]
    
    return filter_df

inverted_NYC_BB = (40.5, 41.8, -74.5, -72.8)

inverted_outliers = select_within_boundingbox(latlon_outliers, inverted_NYC_BB)

inverted_outliers.describe()
Out[17]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 955.000000 955.000000 955.000000 955.000000 955.000000 955.000000
mean 12.863298 40.747307 -73.974268 40.749219 -73.972069 2.068063
std 10.286091 0.029364 0.039588 0.031698 0.035841 1.676879
min 2.500000 40.641447 -74.017270 40.615782 -74.035197 1.000000
25% 6.750000 40.732793 -73.993080 40.731224 -73.991084 1.000000
50% 9.500000 40.750352 -73.983215 40.751705 -73.979890 1.000000
75% 15.000000 40.764841 -73.968561 40.766713 -73.963399 2.000000
max 60.000000 40.850357 -73.776727 40.886377 -73.746810 6.000000

What to do about these outliers?

The bounding box with inverted coordinates is located in Antartica, so obviously that part of the data is invalid. However, there are 955 records here which have these inverted coordinates, and it's pretty easy to see how this might have happened: the latitude and longitude probably were accidentally swapped. And notice how all the datapoints within the swapped window of lat/lon all seem to be clustered together? It appears that perhaps a cab or a cab company was mistakenly configured for the latitude to record in the longitude column and vice versa.

While 955 records out of 2 million is not even statistically meaningful, it's an easy enough task to convert this data back to a usable state and which likely is accurate to the original source. So I'll end up swapping these values and reincorporating them into the train dataframe.

In [18]:
def swap_inverted(df):
    fixed_df = df.rename(columns={'pickup_longitude' : 'pickup_latitude', 'pickup_latitude' : 'pickup_longitude',
                                                       'dropoff_longitude' : 'dropoff_latitude', 'dropoff_latitude' : 'dropoff_longitude'})
    col_list = fixed_df.columns.tolist()
    
    col_list[3], col_list[4], col_list[5], col_list[6] = col_list[4], col_list[3], col_list[6], col_list[5]
    
    fixed_df = fixed_df[col_list]
    
    return fixed_df

fixed_outliers = swap_inverted(inverted_outliers)

fixed_outliers.describe()
Out[18]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 955.000000 955.000000 955.000000 955.000000 955.000000 955.000000
mean 12.863298 -73.974268 40.747307 -73.972069 40.749219 2.068063
std 10.286091 0.039588 0.029364 0.035841 0.031698 1.676879
min 2.500000 -74.017270 40.641447 -74.035197 40.615782 1.000000
25% 6.750000 -73.993080 40.732793 -73.991084 40.731224 1.000000
50% 9.500000 -73.983215 40.750352 -73.979890 40.751705 1.000000
75% 15.000000 -73.968561 40.764841 -73.963399 40.766713 2.000000
max 60.000000 -73.776727 40.850357 -73.746810 40.886377 6.000000
In [19]:
## Now we'll remove all rows with a datapoint that doesn't fall within the bounding box for NYC coordinates

print("Old size: {}".format(len(train_df)))

train_df = select_within_boundingbox(train_df, NYC_BB)

print("New size: {}".format(len(train_df)))

train_df.describe()
Old size: 1953460
New size: 1950906
Out[19]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 1.950906e+06 1.950906e+06 1.950906e+06 1.950906e+06 1.950906e+06 1.950906e+06
mean 1.133859e+01 -7.397514e+01 4.075106e+01 -7.397426e+01 4.075141e+01 1.690258e+00
std 9.723219e+00 3.859527e-02 2.959620e-02 3.777685e-02 3.276382e-02 1.305164e+00
min 2.500000e+00 -7.448963e+01 4.050005e+01 -7.449105e+01 4.050005e+01 1.000000e+00
25% 6.000000e+00 -7.399229e+01 4.073654e+01 -7.399159e+01 4.073552e+01 1.000000e+00
50% 8.500000e+00 -7.398210e+01 4.075333e+01 -7.398062e+01 4.075383e+01 1.000000e+00
75% 1.250000e+01 -7.396834e+01 4.076752e+01 -7.396536e+01 4.076839e+01 2.000000e+00
max 5.000000e+02 -7.281783e+01 4.169685e+01 -7.281783e+01 4.171463e+01 9.000000e+00
In [20]:
train_df_copy = train_df # Created a copy so as to avoid the possibility of adding the fixed outliers multiple times

train_df = pd.concat([train_df_copy, fixed_outliers], ignore_index=True)

train_df_copy = None  # Doing this to try to be a bit more memory efficient

train_df.describe()
Out[20]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06
mean 1.133934e+01 -7.397514e+01 4.075106e+01 -7.397426e+01 4.075141e+01 1.690443e+00
std 9.723558e+00 3.859576e-02 2.959620e-02 3.777595e-02 3.276334e-02 1.305398e+00
min 2.500000e+00 -7.448963e+01 4.050005e+01 -7.449105e+01 4.050005e+01 1.000000e+00
25% 6.000000e+00 -7.399229e+01 4.073654e+01 -7.399159e+01 4.073552e+01 1.000000e+00
50% 8.500000e+00 -7.398210e+01 4.075333e+01 -7.398062e+01 4.075383e+01 1.000000e+00
75% 1.250000e+01 -7.396834e+01 4.076752e+01 -7.396536e+01 4.076839e+01 2.000000e+00
max 5.000000e+02 -7.281783e+01 4.169685e+01 -7.281783e+01 4.171463e+01 9.000000e+00

LatLon Outlier Fixing Function

Now that the investigation is over and a plan of action has been decided, I'm turning all of the transformation steps into a function, which will later be used on the whole dataset.

In [21]:
def latlon_outlier_fix(df, BB, verbose=False):
    ## This also fixes 0s in columns like passenger_count
    df = drop_0s(df, verbose)
    
    outliers = select_outside_boundingbox(df, BB)
    
    if verbose:
        print("Dataset outliers: {}".format(len(outliers)))
    
    inverted_BB = (BB[2], BB[3], BB[0], BB[1])

    inverted_outliers = select_within_boundingbox(outliers, inverted_BB)
    
    if verbose:
        print("Inverted outliers: {}".format(len(inverted_outliers)))
        
    fixed_outliers = swap_inverted(inverted_outliers)
    
    if verbose:
        print("")
        old_size = len(df)
        print("Old size: {}".format(old_size))

    df = select_within_boundingbox(df, BB)

    if verbose:
        new_size = len(df)
        print("New size: {}".format(new_size))
        difference = old_size - new_size
        percent = (difference / old_size) * 100
        print("Dropped {} records, or {:.2f}%".format(difference, percent))

    df = pd.concat([df, fixed_outliers])
    
    if verbose:
        newer_size = len(df)
        print("Final size: {}".format(newer_size))
        difference = newer_size - new_size
        percent = (difference / new_size) * 100
        print("Fixed {} records, or {:.2f}%".format(difference, percent))
    
    return df

Turn the dates into a usable format for the model

Since the model would just view the dates as a string, and at best just one-hot encode each second represented into a bucket (or at worst, since that would be next to impossible to compute on such a large dataset), let's bucketize each datetime into all its components ranging from hour of the day to the year. For this initial model, I'll leave these categorical values as-is, but later I'll perform feature selection and run feature crosses to distil this into only the most useful data.

In [22]:
## Before we begin, let's measure the memory impact that adding new columns might have

def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
        
    else: # We assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
        
    usage_mb = usage_b / 1024 ** 2 # Convert bytes to megabytes
    
    return "{:03.2f} MB".format(usage_mb)

print("Total Mem Usage: {}".format(mem_usage(train_df)))
Total Mem Usage: 261.87 MB
In [23]:
def prepare_time_features(df, verbose=False, drop=False):
    df["hour"] = df.pickup_datetime.dt.hour
    df["day_of_week"] = df.pickup_datetime.dt.weekday
    df["day_of_month"] = df.pickup_datetime.dt.day
    df["week"] = df.pickup_datetime.dt.week
    df["month"] = df.pickup_datetime.dt.month
    df["year"] = df.pickup_datetime.dt.year - 2000  # Reducing to 2 digits for less memory usage

    if drop:
        df.drop(columns=['pickup_datetime'], inplace=True)
    
    if verbose:
        print("Total Mem Usage: {}".format(mem_usage(train_df)))
        display(df.head())
        
    return df

train_df = prepare_time_features(train_df, True, True)

test_df = prepare_time_features(test_df, False, True)

train_df.describe()
Total Mem Usage: 336.32 MB
key fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day_of_week day_of_month week month year
0 2009-06-15 17:26:21.0000001 4.5 -73.844311 40.721319 -73.841610 40.712278 1 17 0 15 25 6 9
1 2010-01-05 16:52:16.0000002 16.9 -74.016048 40.711303 -73.979268 40.782004 1 16 1 5 1 1 10
2 2011-08-18 00:35:00.00000049 5.7 -73.982738 40.761270 -73.991242 40.750562 2 0 3 18 33 8 11
3 2012-04-21 04:30:42.0000001 7.7 -73.987130 40.733143 -73.991567 40.758092 1 4 5 21 16 4 12
4 2010-03-09 07:51:00.000000135 5.3 -73.968095 40.768008 -73.956655 40.783762 1 7 1 9 10 3 10
Out[23]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day_of_week day_of_month week month year
count 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06 1.951861e+06
mean 1.133934e+01 -7.397514e+01 4.075106e+01 -7.397426e+01 4.075141e+01 1.690443e+00 1.350771e+01 3.041674e+00 1.570912e+01 2.546962e+01 6.270555e+00 1.173895e+01
std 9.723558e+00 3.859576e-02 2.959620e-02 3.777595e-02 3.276334e-02 1.305398e+00 6.515213e+00 1.949858e+00 8.681286e+00 1.494934e+01 3.436912e+00 1.865953e+00
min 2.500000e+00 -7.448963e+01 4.050005e+01 -7.449105e+01 4.050005e+01 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 9.000000e+00
25% 6.000000e+00 -7.399229e+01 4.073654e+01 -7.399159e+01 4.073552e+01 1.000000e+00 9.000000e+00 1.000000e+00 8.000000e+00 1.300000e+01 3.000000e+00 1.000000e+01
50% 8.500000e+00 -7.398210e+01 4.075333e+01 -7.398062e+01 4.075383e+01 1.000000e+00 1.400000e+01 3.000000e+00 1.600000e+01 2.400000e+01 6.000000e+00 1.200000e+01
75% 1.250000e+01 -7.396834e+01 4.076752e+01 -7.396536e+01 4.076839e+01 2.000000e+00 1.900000e+01 5.000000e+00 2.300000e+01 3.900000e+01 9.000000e+00 1.300000e+01
max 5.000000e+02 -7.281783e+01 4.169685e+01 -7.281783e+01 4.171463e+01 9.000000e+00 2.300000e+01 6.000000e+00 3.100000e+01 5.300000e+01 1.200000e+01 1.500000e+01
In [24]:
train_df.dtypes
Out[24]:
key                   object
fare_amount          float64
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
passenger_count        int64
hour                   int64
day_of_week            int64
day_of_month           int64
week                   int64
month                  int64
year                   int64
dtype: object
In [25]:
print("Total Mem Usage: {} \n".format(mem_usage(train_df)))

df_int = train_df.select_dtypes(include=['int64', 'int32', 'int16', 'int8', 'int'])
converted_int = df_int.apply(pd.to_numeric,downcast='unsigned')

print("Int Mem Usage: {}".format(mem_usage(df_int)))
print("New Int Mem Usage: {}".format(mem_usage(converted_int)))

compare_ints = pd.concat([df_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before','after']
compare_ints.apply(pd.Series.value_counts)
Total Mem Usage: 336.32 MB
Int Mem Usage: 104.24 MB
New Int Mem Usage: 13.03 MB
Out[25]:
before after
uint8 NaN 7.0
int64 7.0 NaN
In [26]:
train_df[df_int.columns] = df_int.apply(pd.to_numeric, downcast='unsigned')

train_df.dtypes
Out[26]:
key                   object
fare_amount          float64
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
passenger_count        uint8
hour                   uint8
day_of_week            uint8
day_of_month           uint8
week                   uint8
month                  uint8
year                   uint8
dtype: object
In [27]:
df_float = train_df.select_dtypes(include=['float64', 'float32', 'float16', 'float'])
converted_float = df_float.apply(pd.to_numeric,downcast='float')

print("Float Mem Usage: {}".format(mem_usage(df_float)))
print("New Float Mem Usage: {}".format(mem_usage(converted_float)))

compare_floats = pd.concat([df_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.apply(pd.Series.value_counts)
Float Mem Usage: 74.46 MB
New Float Mem Usage: 37.23 MB
Out[27]:
before after
float32 NaN 5.0
float64 5.0 NaN
In [28]:
train_df[df_float.columns] = df_float.apply(pd.to_numeric,downcast='float')

print(train_df.dtypes)

print("Total Mem Usage: {}".format(mem_usage(train_df)))
key                   object
fare_amount          float32
pickup_longitude     float32
pickup_latitude      float32
dropoff_longitude    float32
dropoff_latitude     float32
passenger_count        uint8
hour                   uint8
day_of_week            uint8
day_of_month           uint8
week                   uint8
month                  uint8
year                   uint8
dtype: object
Total Mem Usage: 207.88 MB

Numeric downcast function

With noticable memory reduction, it'll be useful to have this packaged into a function that can be used on the whole dataset.

In [29]:
def downcast(df, verbose=False):
    if verbose:
        print("Original Mem Usage: {}".format(mem_usage(df)))
    
    df_int = df.select_dtypes(include=['int64', 'int32', 'int16', 'int8', 'int'])
    df[df_int.columns] = df_int.apply(pd.to_numeric,downcast='unsigned')
    
    df_float = df.select_dtypes(include=['float64', 'float32', 'float16', 'float'])
    df[df_float.columns] = df_float.apply(pd.to_numeric,downcast='float')
        
    if verbose:
        print("Optimized Mem Usage: {}".format(mem_usage(df)))
    
    return df

Location data

Sice we're dealing with location data, lets plot the coordinates on a map to get a better view of the data.

For this, Albert used the following websites:

Albert also defined a bounding box of interest by [long_min, long_max, latt_min, latt_max] using the minimum and maximum coordinates from the testset.

From Open Street Map Albert grabbed a map and dropped any datapoints outside that box.

In [30]:
nyc_map = plt.imread('https://aiblog.nl/download/nyc_-74.5_-72.8_40.5_41.8.png')

## load extra image to zoom in on NYC
NYC_BB_zoom = (-74.3, -73.7, 40.5, 40.9)
nyc_map_zoom = plt.imread('https://aiblog.nl/download/nyc_-74.3_-73.7_40.5_40.9.png')
In [31]:
## This function will be used more often to plot data on the NYC map
def plot_on_map(df, BB, nyc_map, s=10, alpha=0.2):
    fig, axs = plt.subplots(1, 2, figsize=(16,10))
    
    axs[0].scatter(df.pickup_longitude, df.pickup_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[0].set_xlim((BB[0], BB[1]))
    axs[0].set_ylim((BB[2], BB[3]))
    axs[0].set_title('Pickup locations')
    axs[0].imshow(nyc_map, zorder=0, extent=BB)

    axs[1].scatter(df.dropoff_longitude, df.dropoff_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[1].set_xlim((BB[0], BB[1]))
    axs[1].set_ylim((BB[2], BB[3]))
    axs[1].set_title('Dropoff locations')
    axs[1].imshow(nyc_map, zorder=0, extent=BB)
In [32]:
## Plot training data on map
plot_on_map(train_df, NYC_BB, nyc_map, s=1, alpha=0.3)
In [33]:
## Plot training data on map zoomed in
plot_on_map(train_df, NYC_BB_zoom, nyc_map_zoom, s=1, alpha=0.3)
In [34]:
## Plot test data on map
plot_on_map(test_df, NYC_BB, nyc_map, alpha=1.0, s=20)
In [35]:
def plot_hires(df, BB, figsize=(12, 12), ax=None, c=('r', 'b')):
    if ax == None:
        fig, ax = plt.subplots(1, 1, figsize=figsize)

    idx = select_within_boundingbox(df, BB)
    ax.scatter(idx.pickup_longitude, idx.pickup_latitude, c=c[0], s=0.01, alpha=0.5)
    ax.scatter(idx.dropoff_longitude, idx.dropoff_latitude, c=c[1], s=0.01, alpha=0.5)

An other interesting way to visualize the data Albert learned from this kernel: https://www.kaggle.com/drgilermo/dynamics-of-new-york-city-animation. By using very small dot sizes the actual streets of New York become visible.

In [36]:
plot_hires(train_df, (-74.1, -73.7, 40.6, 40.9))
plot_hires(train_df, (-74, -73.95, 40.7, 40.8))
In [37]:
## Plot histogram of fare
train_df[train_df.fare_amount<100].fare_amount.hist(bins=100, figsize=(14,3))
plt.xlabel('fare $USD')
plt.title('Histogram');

The small spikes between \$40 and \$60 could indicate some fixed fare price (e.g. to/from airport). This will be explored further below.

Removing datapoints in water

These datapoints are clearly erraneous and I can't think of a way to fix them and get useful info from them, so we'll work to remove them. For this Albert used Photoshop to threshold on the blue color of the water and to cleanup the map. The resulting map is show below.

In [38]:
## Read nyc mask and turn into boolean map with land = True, water = False
nyc_mask = plt.imread('https://aiblog.nl/download/nyc_mask-74.5_-72.8_40.5_41.8.png')[:,:,0] > 0.9

plt.figure(figsize=(8,8))
plt.imshow(nyc_map, zorder=0)
plt.imshow(nyc_mask, zorder=1, alpha=0.7); # note: True is show in black, False in white.

Next, we convert longitude/latitude coordinates to xy pixel coordinates using lonlat_to_xy. Note that the y coordinate is reversed as the image y-axis goes from top to bottom. Then a boolean index is calculated using nyc_mask and we use a function to remove the offending datapoints.

In [39]:
def remove_datapoints_from_water(df, BB, nyc_mask, verbose=False):
    def lonlat_to_xy(longitude, latitude, dx, dy, BB):
        return (dx*(longitude - BB[0])/(BB[1]-BB[0])).astype('int'), \
               (dy - dy*(latitude - BB[2])/(BB[3]-BB[2])).astype('int')

    if verbose:
        print("Removing all rows with a datapoint in the water:")
        old_size = len(df)
        print("Old size: {}".format(old_size))
    
    # calculate for each lon,lat coordinate the xy coordinate in the mask map
    pickup_x, pickup_y = lonlat_to_xy(df.pickup_longitude, df.pickup_latitude, 
                                      nyc_mask.shape[1], nyc_mask.shape[0], BB)
    dropoff_x, dropoff_y = lonlat_to_xy(df.dropoff_longitude, df.dropoff_latitude, 
                                      nyc_mask.shape[1], nyc_mask.shape[0], BB)    
    # calculate boolean index
    idx = nyc_mask[pickup_y, pickup_x] & nyc_mask[dropoff_y, dropoff_x]
    
    df = df.loc[idx]

    if verbose:
        print("Number of trips in water: {}".format(np.sum(~idx)))
        new_size = len(df)
        print("New size: {}".format(new_size))
        difference = old_size - new_size
        percent = (difference / old_size) * 100
        print("Dropped {} records, or {:.2f}%".format(difference, percent))
    
    # return only datapoints on land
    return df
In [40]:
train_df = remove_datapoints_from_water(train_df, NYC_BB, nyc_mask, verbose=True)
Removing all rows with a datapoint in the water:
Old size: 1951861
Number of trips in water: 388
New size: 1951473
Dropped 388 records, or 0.02%

Now let's see if all outliers in the water are still there.

In [41]:
# plot training data
plot_on_map(train_df, NYC_BB, nyc_map)

Datapoint density per sq km

The above scatterplot gives a quick impression of the density, but it's more accurate to use actual counts of datapoints. Lets plot out the hotspots:

In [42]:
def distance(lat1, lon1, lat2, lon2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    ## Convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    ## Haversine formula 
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

def apply_distance(df, verbose=False):
    df['distance_km'] = distance(df.pickup_latitude, df.pickup_longitude,
                                df.dropoff_latitude, df.dropoff_longitude)
    if verbose:
        display(df.distance_km.describe())

    return df

## First calculate two arrays with datapoint density per sq km
n_lon, n_lat = 200, 200 # number of grid bins per longitude, latitude dimension
density_pickup, density_dropoff = np.zeros((n_lat, n_lon)), np.zeros((n_lat, n_lon)) # Prepare arrays

"""
To calculate the number of datapoints in a grid area, the numpy.digitize() function is used. 
This function needs an array with the (location) bins for counting the number of datapoints per bin.
"""
bins_lon = np.zeros(n_lon+1) # bin
bins_lat = np.zeros(n_lat+1) # bin

delta_lon = (NYC_BB[1]-NYC_BB[0]) / n_lon # bin longutide width
delta_lat = (NYC_BB[3]-NYC_BB[2]) / n_lat # bin latitude height

bin_width_kms = distance(NYC_BB[2], NYC_BB[1], NYC_BB[2], NYC_BB[0]) / n_lon # Bin width in kms
bin_height_kms = distance(NYC_BB[3], NYC_BB[0], NYC_BB[2], NYC_BB[0]) / n_lat # Bin height in kms

for i in range(n_lon+1):
    bins_lon[i] = NYC_BB[0] + i * delta_lon

for j in range(n_lat+1):
    bins_lat[j] = NYC_BB[2] + j * delta_lat
    
## Digitize per longitude, latitude dimension
inds_pickup_lon = np.digitize(train_df.pickup_longitude, bins_lon)
inds_pickup_lat = np.digitize(train_df.pickup_latitude, bins_lat)
inds_dropoff_lon = np.digitize(train_df.dropoff_longitude, bins_lon)
inds_dropoff_lat = np.digitize(train_df.dropoff_latitude, bins_lat)

"""
Count per grid bin
note: as the density_pickup will be displayed as image, the first index is the y-direction, 
      the second index is the x-direction. Also, the y-direction needs to be reversed for
      properly displaying (therefore the (n_lat-j) term)
"""
dxdy = bin_width_kms * bin_height_kms

for i in range(n_lon):
    for j in range(n_lat):
        density_pickup[j, i] = np.sum((inds_pickup_lon==i+1) & (inds_pickup_lat==(n_lat-j))) / dxdy
        density_dropoff[j, i] = np.sum((inds_dropoff_lon==i+1) & (inds_dropoff_lat==(n_lat-j))) / dxdy
In [43]:
## Plot the density arrays
fig, axs = plt.subplots(2, 1, figsize=(18, 24))

axs[0].imshow(nyc_map, zorder=0, extent=NYC_BB);
im = axs[0].imshow(np.log1p(density_pickup), zorder=1, extent=NYC_BB, alpha=0.6, cmap='plasma')
axs[0].set_title('Pickup density [datapoints per sq km]')
cbar = fig.colorbar(im, ax=axs[0])
cbar.set_label('log(1 + # datapoints per sq km)', rotation=270)

axs[1].imshow(nyc_map, zorder=0, extent=NYC_BB);
im = axs[1].imshow(np.log1p(density_dropoff), zorder=1, extent=NYC_BB, alpha=0.6, cmap='plasma')
axs[1].set_title('Dropoff density [datapoints per sq km]')
cbar = fig.colorbar(im, ax=axs[1])
cbar.set_label('log(1 + # datapoints per sq km)', rotation=270)

The datapoints concentrate around Manhattan and the three airports (JFK, EWS, LGR), along with a hotspot near Seymour (upper right corner).

Longer distance, higher fare

First, lets calculate the straight-line distance between points.

In [44]:
## Add new column to dataframe with distance in kms
train_df = apply_distance(train_df, True)

train_df.distance_km.hist(bins=50, figsize=(12,4))
plt.xlabel('distance km')
plt.title('Histogram ride distances in km')
C:\Users\Watki\Anaconda3\envs\MLND\lib\site-packages\ipykernel\__main__.py:13: RuntimeWarning: invalid value encountered in sqrt
count    1.951473e+06
mean     3.330480e+00
std      3.778429e+00
min      0.000000e+00
25%      1.253831e+00
50%      2.152846e+00
75%      3.916280e+00
max      1.147885e+02
Name: distance_km, dtype: float64
Out[44]:
Text(0.5,1,'Histogram ride distances in km')

It seems that most rides are just short rides, with a small peak at ~20 kms. This peak could be due to airport drives.

Let's also see if passenger_count changes anything.

In [45]:
train_df.groupby('passenger_count')['distance_km', 'fare_amount'].mean()
Out[45]:
distance_km fare_amount
passenger_count
1 3.283213 11.203874
2 3.507306 11.800907
3 3.380833 11.520043
4 3.430901 11.721943
5 3.335211 11.218318
6 3.415208 12.168925
9 13.046386 104.000000

Baseline

If the model can't beat a simple calculation of applying the average dollars per km, then the model won't be able to justify its computational cost. So, I'll use the average of \$3.40 USD/km (with anything under a total of $2.50 rounded up) as my baseline.

In [46]:
print("Average $USD/km : {:0.2f}".format(train_df.fare_amount.sum()/train_df.distance_km.sum()))
Average $USD/km : 3.40

Plot Fare and Distance

To confirm that more than just the distance goes into a taxi's fare, let's plot the two and if we don't get a straight line, we'll know that there's more that goes into it.

In [47]:
## Scatter plot distance - fare
fig, axs = plt.subplots(1, 2, figsize=(16,6))
axs[0].scatter(train_df.distance_km, train_df.fare_amount, alpha=0.2)
axs[0].set_xlabel('distance km')
axs[0].set_ylabel('fare $USD')
axs[0].set_title('All data')

## Zoom in on part of data
idx = (train_df.distance_km < 30) & (train_df.fare_amount < 100)
axs[1].scatter(train_df[idx].distance_km, train_df[idx].fare_amount, alpha=0.2)
axs[1].set_xlabel('distance km')
axs[1].set_ylabel('fare $USD')
axs[1].set_title('Zoom in on distance < 30 km, fare < $100');

Observations:

  • Many trips are listed at a 0 distance. There might be some common factor that helps explain these trips. Let's look into that next.
  • Certain fares seem to have high numbers across a wide range of distances (notice the horizontal lines). We will have to investigate why that is.
  • Generally, there is a linear correlation between distance and fare, but that grouping is very broad, so there must be many more factors that go into determining the fare.

Knowing all these factors we could theoretically hard code our own decision tree and try to recreate the specific algorithm that is inside each taxi meter; then we'd only have to predict the route the driver would choose to take and the travel time for that time of day. However, I believe that misses the point of machine learning and this competition, so I'm going to try to train my algorithm to pick up on all of these subtleties without having to spell everything out.

In [48]:
def split_distance(df, split_point=.1, verbose=False):
    if verbose:
        print("Removing all distances less than {} km:".format(split_point))
        old_size = len(df)
        print("Old size: {}".format(old_size))

    long = df.loc[df['distance_km'] >= split_point]
    short = df.loc[df['distance_km'] < split_point]

    if verbose:
        new_size = len(long)
        print("New size: {}".format(new_size))
        difference = old_size - new_size
        percent = (difference / old_size) * 100
        print("Separated out {} records, or {:.2f}%".format(difference, percent))
    
    return short, long
In [49]:
short_train, train_df = split_distance(train_df, verbose=True)

train_df.describe()
Removing all distances less than 0.1 km:
Old size: 1951473
New size: 1919185
Separated out 32288 records, or 1.65%
Out[49]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day_of_week day_of_month week month year distance_km
count 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06 1.919185e+06
mean 1.131462e+01 -7.397559e+01 4.075101e+01 -7.397466e+01 4.075136e+01 1.691338e+00 1.351460e+01 3.041690e+00 1.570861e+01 2.546898e+01 6.270548e+00 1.174591e+01 3.386330e+00
std 9.481738e+00 3.551736e-02 2.766991e-02 3.460738e-02 3.108841e-02 1.305720e+00 6.514085e+00 1.949736e+00 8.681661e+00 1.495197e+01 3.437570e+00 1.866683e+00 3.785257e+00
min 2.500000e+00 -7.448963e+01 4.050386e+01 -7.449105e+01 4.050465e+01 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 9.000000e+00 1.000114e-01
25% 6.000000e+00 -7.399230e+01 4.073660e+01 -7.399160e+01 4.073559e+01 1.000000e+00 9.000000e+00 1.000000e+00 8.000000e+00 1.300000e+01 3.000000e+00 1.000000e+01 1.292005e+00
50% 8.500000e+00 -7.398213e+01 4.075336e+01 -7.398066e+01 4.075386e+01 1.000000e+00 1.400000e+01 3.000000e+00 1.600000e+01 2.400000e+01 6.000000e+00 1.200000e+01 2.190111e+00
75% 1.250000e+01 -7.396855e+01 4.076752e+01 -7.396561e+01 4.076840e+01 2.000000e+00 1.900000e+01 5.000000e+00 2.300000e+01 3.900000e+01 9.000000e+00 1.300000e+01 3.964554e+00
max 4.880000e+02 -7.285966e+01 4.169685e+01 -7.284809e+01 4.171463e+01 9.000000e+00 2.300000e+01 6.000000e+00 3.100000e+01 5.300000e+01 1.200000e+01 1.500000e+01 1.147885e+02
In [50]:
short_train.describe()
Out[50]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day_of_week day_of_month week month year distance_km
count 32288.000000 32288.000000 32288.000000 32288.000000 32288.000000 32288.000000 32288.000000 32288.000000 32288.000000 32288.000000 32288.000000 32288.000000 32288.000000
mean 12.795007 -73.949486 40.755535 -73.949486 40.755535 1.635623 13.099325 3.040634 15.739996 25.498328 6.268645 11.340188 0.010712
std 19.183846 0.117820 0.084796 0.117821 0.084794 1.283686 6.573326 1.957509 8.665473 14.799633 3.399262 1.778374 0.023012
min 2.500000 -74.481644 40.500046 -74.481628 40.500046 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 9.000000 0.000000
25% 4.000000 -73.991112 40.732295 -73.991135 40.732307 1.000000 8.000000 1.000000 8.000000 13.000000 3.000000 10.000000 0.000000
50% 6.500000 -73.976761 40.752159 -73.976799 40.752155 1.000000 14.000000 3.000000 16.000000 25.000000 6.000000 11.000000 0.000000
75% 11.300000 -73.949482 40.767302 -73.949493 40.767285 2.000000 19.000000 5.000000 23.000000 38.000000 9.000000 13.000000 0.004459
max 500.000000 -72.817833 41.513023 -72.817833 41.513023 6.000000 23.000000 6.000000 31.000000 53.000000 12.000000 15.000000 0.099989
In [51]:
short_train.loc[short_train['distance_km'] == 0.0].head()
Out[51]:
key fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day_of_week day_of_month week month year distance_km
102 2009-03-25 00:08:52.0000001 52.0 -74.035835 40.747318 -74.035835 40.747318 1 0 2 25 13 3 9 0.0
187 2014-01-08 21:55:58.0000006 6.5 -73.998489 40.726303 -73.998489 40.726303 1 21 2 8 2 1 14 0.0
264 2012-08-25 01:53:42.0000005 7.5 -73.995895 40.746452 -73.995895 40.746452 1 1 5 25 34 8 12 0.0
283 2009-12-14 12:33:00.00000075 6.9 -73.982430 40.745747 -73.982430 40.745747 1 12 0 14 51 12 9 0.0
385 2014-03-12 18:12:44.0000006 12.0 -73.844902 40.736317 -73.844902 40.736317 1 18 2 12 11 3 14 0.0
In [52]:
test_df = apply_distance(test_df, True)

short_test, long_test = split_distance(test_df, verbose=True)
count    9914.000000
mean        3.435371
std         3.972374
min         0.000000
25%         1.298277
50%         2.217412
75%         4.045302
max        99.996040
Name: distance_km, dtype: float64
Removing all distances less than 0.1 km:
Old size: 9914
New size: 9790
Separated out 124 records, or 1.25%
In [53]:
short_test.describe()
Out[53]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day_of_week day_of_month week month year distance_km
count 124.000000 124.000000 124.000000 124.000000 124.000000 124.000000 124.000000 124.000000 124.000000 124.000000 124.000000 124.000000
mean -73.939261 40.770322 -73.939245 40.770322 1.741935 12.822581 2.822581 16.983871 29.419355 7.145161 11.427419 0.012906
std 0.153912 0.113946 0.153908 0.113946 1.354733 6.629631 2.052357 8.745716 13.262881 3.091277 1.711622 0.024845
min -74.047394 40.622553 -74.047394 40.622553 1.000000 0.000000 0.000000 1.000000 3.000000 1.000000 9.000000 0.000000
25% -73.988079 40.739215 -73.988135 40.739215 1.000000 7.750000 1.000000 10.000000 20.000000 5.000000 10.000000 0.000000
50% -73.976690 40.752883 -73.976690 40.752883 1.000000 13.000000 3.000000 19.000000 29.500000 7.000000 11.000000 0.000000
75% -73.955018 40.769847 -73.955018 40.770109 2.000000 18.000000 5.000000 24.000000 40.000000 10.000000 13.000000 0.012726
max -73.137393 41.366138 -73.137393 41.366138 6.000000 23.000000 6.000000 31.000000 51.000000 12.000000 15.000000 0.097406
In [54]:
test_df.loc[test_df['distance_km'] == 0.0].head()
Out[54]:
key pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day_of_week day_of_month week month year distance_km
121 2014-06-14 13:39:00.000000191 -73.980590 40.747682 -73.980590 40.747682 1 13 5 14 24 6 14 0.0
279 2010-09-05 22:31:32.0000002 -74.047394 40.785789 -74.047394 40.785789 1 22 6 5 35 9 10 0.0
318 2009-06-10 16:55:00.000000131 -73.985862 40.744027 -73.985862 40.744027 1 16 2 10 24 6 9 0.0
417 2011-06-24 12:03:00.000000145 -73.964615 40.777620 -73.964615 40.777620 1 12 4 24 25 6 11 0.0
423 2011-06-24 12:03:00.00000086 -73.955065 40.771505 -73.955065 40.771505 1 12 4 24 25 6 11 0.0

It's hard to imagine records for distances less than 0.1 km (328 feet or 100 meters; which is less than the length of most blocks) as being accurate data except in very rare edge cases. But only about 1% of the datapoints in the train and test set are affected by this discrepancy. They seem to happen at a relatively even distribution across a wide range of fares, locations, times and dates.

And since there seems to be no aparent rhyme or reason to these unreasonably short distances, their existence would probably only confuse the model, which is why I've removed them. Even though the test set also contains unreasonably short distances, I don't think there's a good way to accurately learn from the ones in the train set, and so the model will have to just guess at those based on other related features. With the RMSE metric, I don't think this 1% of datapoints will throw off the accuracy too drastically.

While we're questioning the validity of the test data, lets check it for datapoints in the water.

In [55]:
remove_datapoints_from_water(test_df, NYC_BB, nyc_mask, verbose=True)
len(test_df)
Removing all rows with a datapoint in the water:
Old size: 9914
Number of trips in water: 2
New size: 9912
Dropped 2 records, or 0.02%
Out[55]:
9914

And sure enough, there are a couple of datapoints in the water, but that's not too bad.

Fixed fee airport trips

It's not uncommon for taxis to charge fixed rates for trips to/from an airport. Let's check and see if that's a factor in this dataset.

In [56]:
## JFK airport coordinates, see https://www.travelmath.com/airport/JFK
jfk = (-73.7822222222, 40.6441666667)
nyc = (-74.0063889, 40.7141667)

def plot_location_fare(loc, name, range=1.5):
    ## Select all datapoints with dropoff location within range of airport
    fig, axs = plt.subplots(1, 2, figsize=(14, 5))
    
    idx = (distance(train_df.pickup_latitude, train_df.pickup_longitude, loc[1], loc[0]) < range)
    train_df[idx].fare_amount.hist(bins=100, ax=axs[0])
    axs[0].set_xlabel('fare $USD')
    axs[0].set_title('Histogram pickup location within {} kms of {}'.format(range, name))

    idx = (distance(train_df.dropoff_latitude, train_df.dropoff_longitude, loc[1], loc[0]) < range)
    train_df[idx].fare_amount.hist(bins=100, ax=axs[1])
    axs[1].set_xlabel('fare $USD')
    axs[1].set_title('Histogram dropoff location within {} kms of {}'.format(range, name));
    
plot_location_fare(jfk, 'JFK Airport')

It definitely looks like trips to and from an airport run at a fixed total rate.

Let's also check the other two big airports.

In [57]:
ewr = (-74.175, 40.69) # Newark Liberty International Airport, see https://www.travelmath.com/airport/EWR
lgr = (-73.87, 40.77) # LaGuardia Airport, see https://www.travelmath.com/airport/LGA

plot_location_fare(ewr, 'Newark Airport')
plot_location_fare(lgr, 'LaGuardia Airport')

While I could certainly anticipate these special cases, and create a complex filter to catch all the fixed fare airport rides, then train with extra columns. This would be not too dissimilar to skipping the model completely for these datapoints and hard coding in some rules for any rides to or from a certain distance from an airport. However, that neglects the whole point of machine learning, which is to avoid hard coding in complex rules. Furthermore, I'm using feature crosses on the lat/lon data so the model will likely be able to notice these patterns on its own and hopefully predict the flat rates without prompting.

I think the preparation of the data I've done so far has been sufficient to set this model up for success, so now I just need to apply it to all the data and feed it to the model.

Create a hashing function to separate the train data into train and validate

In [58]:
def split_by_hash(df, train_percent=80):
    """
    This function splits the dataset in a repeatable manner using hashes of the key column.
    
    train_percent (int) is the percent of the dataset that gets divided into the train file, with the remaining percent going into the validate file.
    example: train_percent=75 would result in roughly 25% of the data going into the validate file
    """
    df["hashed_result"] = df.key.apply(hash) % 100

    train = df[df.hashed_result < train_percent]
    validate = df[df.hashed_result >= train_percent]
    
    train.drop(columns='hashed_result', inplace=True)
    validate.drop(columns='hashed_result', inplace=True)
    
    return train, validate

Prepare train and validation files on the full dataset

The below function takes around 5 hours to run in full.

Note: I was able to run this smoothly until chunk 28, which ran into an IndexError in the remove_datapoints_from_water function, so I saved a copy of the first 27 chunks as a smaller dataset (if 27 million rows is considered small) and tried again. It encountered the error again on the same index 1282, but I couldn't find anything out of the ordinary. So in the interest of time, I try/except skipped that error, because it should hurt things too badly if a few datapoints in the water get missed. I also added the if count check so that I could just start appending to the previous file from where it left off.

In [61]:
def prepare_train_valid(file, BB, verbose=False, chunksize=1000000, train_percent=80):
    """
    NOTE: If rerunning this, make sure to delete the previous train/valid files first.
    """
    
    count = 0
    train_file = "taxi-train.csv"
    validation_file = "taxi-valid.csv"
    for df in pd.read_csv(file, chunksize=chunksize, parse_dates=["pickup_datetime"]):
        count += 1
        print("Processing chunk: {} \r".format(count),)
        
        if count > 27:
            df = fix_fares(df, verbose)
            df = drop_nan(df, verbose)
            df = drop_0s(df, verbose)
            df = latlon_outlier_fix(df, BB, verbose)
            df = prepare_time_features(df, verbose, drop=True)
            
            try:
                df = remove_datapoints_from_water(df, NYC_BB, nyc_mask, verbose)
            except IndexError:
                print("Ignored Index Error")
            
            df = apply_distance(df, verbose)
            short_train, df = split_distance(df, verbose)
            
    #         df = downcast(df, verbose) # Not using due to possible issues with dtypes in TF
    
            train_df, validate_df = split_by_hash(df, train_percent)

            train_df.to_csv(train_file, index=False, header=False, mode='a')
            validate_df.to_csv(validation_file, index=False, header=False, mode='a')
In [62]:
prepare_train_valid("./datafiles/train.csv", NYC_BB)
Processing chunk: 1 
Processing chunk: 2 
Processing chunk: 3 
Processing chunk: 4 
Processing chunk: 5 
Processing chunk: 6 
Processing chunk: 7 
Processing chunk: 8 
Processing chunk: 9 
Processing chunk: 10 
Processing chunk: 11 
Processing chunk: 12 
Processing chunk: 13 
Processing chunk: 14 
Processing chunk: 15 
Processing chunk: 16 
Processing chunk: 17 
Processing chunk: 18 
Processing chunk: 19 
Processing chunk: 20 
Processing chunk: 21 
Processing chunk: 22 
Processing chunk: 23 
Processing chunk: 24 
Processing chunk: 25 
Processing chunk: 26 
Processing chunk: 27 
Processing chunk: 28 
C:\Users\Watki\Anaconda3\envs\MLND\lib\site-packages\pandas\core\frame.py:3694: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
Processing chunk: 29 
Processing chunk: 30 
Processing chunk: 31 
Processing chunk: 32 
Processing chunk: 33 
Processing chunk: 34 
Processing chunk: 35 
Processing chunk: 36 
Processing chunk: 37 
Processing chunk: 38 
Processing chunk: 39 
Processing chunk: 40 
Processing chunk: 41 
Processing chunk: 42 
Processing chunk: 43 
Processing chunk: 44 
Processing chunk: 45 
Processing chunk: 46 
Processing chunk: 47 
Processing chunk: 48 
Processing chunk: 49 
Processing chunk: 50 
Processing chunk: 51 
Processing chunk: 52 
Processing chunk: 53 
Processing chunk: 54 
Processing chunk: 55 
Processing chunk: 56 

Prepare test file

...with compatible formatting if needed, but all of the non-destructive formatting was already done above during data exploration.

In [59]:
# test_df = downcast(test_df, True)  # Turned off due to dtype errors in TensorFlow

test_df.to_csv("taxi-test.csv", index=False)
In [60]:
keyless_test_df = test_df.drop(columns="key")

keyless_test_df.to_csv("keyless-taxi-test.csv", index=False)

Convert test dataset to a json format

This is for doing batch prediction on Google Cloud ML Engine.

Unfortunately, the workaround that I tried to pass a key through TensorFlow out to the prediction batch, didn't work and I got an Unexpected Tensor Error. So I had to create a separate version of the batch prediction input file without the key column, then later join the batch prediction output with the keys from the test file to create the final submission file.

In [61]:
test_df.to_json("taxi_test_unformatted.json", orient = "records", date_format = "epoch",
                double_precision = 10, force_ascii = True, date_unit = "ms", default_handler = None)

keyless_test_df.to_json("keyless_taxi_test_unformatted.json", orient = "records", date_format = "epoch",
                double_precision = 10, force_ascii = True, date_unit = "ms", default_handler = None)
In [62]:
import json

## Transform json file including keys
json_file = open('taxi_test_unformatted.json', 'r')
json_file = json.load(json_file)
new_json = open('taxi_test.json', 'a+')

for row in json_file:
    new_row = json.dumps(row)
    new_json.write(new_row + "\n")
    
new_json.close()

## Transform json file excluding keys
json_file = open('keyless_taxi_test_unformatted.json', 'r')
json_file = json.load(json_file)
new_json = open('keyless_taxi_test.json', 'a+')

for row in json_file:
    new_row = json.dumps(row)
    new_json.write(new_row + "\n")
    
new_json.close()
In [63]:
prediction_df = pd.DataFrame()

## Create a new DataFrame with the predictions,
## starting from file #2 since that contains the 1st half of rows.
prediction_json = open("./datafiles/taxifare_prediction2.json", 'r')
for row in prediction_json:
    jrow = json.loads(row)
    prediction_row = pd.DataFrame.from_dict(jrow)
    prediction_df = pd.concat([prediction_df, prediction_row], ignore_index=True)    
prediction_json.close()

prediction_json = open("./datafiles/taxifare_prediction1.json", 'r')
for row in prediction_json:
    jrow = json.loads(row)
    prediction_row = pd.DataFrame.from_dict(jrow)
    prediction_df = pd.concat([prediction_df, prediction_row], ignore_index=True)    
prediction_json.close()

prediction_df.describe()
Out[63]:
predictions
count 9914.000000
mean 11.444664
std 8.859412
min 3.119109
25% 6.594593
50% 8.534064
75% 12.460252
max 102.921844
In [64]:
submission_df = test_df
submission_df['fare_amount'] = prediction_df

submission_df.head()
Out[64]:
key pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day_of_week day_of_month week month year distance_km fare_amount
0 2015-01-27 13:08:24.0000002 -73.973320 40.763805 -73.981430 40.743835 1 13 1 27 5 1 15 2.323260 10.100850
1 2015-01-27 13:08:24.0000003 -73.986862 40.719383 -73.998886 40.739201 1 13 1 27 5 1 15 2.425353 10.827985
2 2011-10-08 11:53:44.0000002 -73.982524 40.751260 -73.979654 40.746139 1 11 5 8 40 10 11 0.618628 4.794265
3 2012-12-01 21:12:12.0000002 -73.981160 40.767807 -73.990448 40.751635 1 21 5 1 48 12 12 1.961033 8.774893
4 2012-12-01 21:12:12.0000003 -73.966046 40.789775 -73.988565 40.744427 1 21 5 1 48 12 12 5.387301 16.917740
In [65]:
submission_df.to_csv('taxi-predict-v2.1.csv', columns=['key', 'fare_amount'], index=False)